18  Joins

Today is going to be a few examples riffing on class Group Projects

18.1 Load libraries

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE

18.2 Southern California City Boundaries

Many groups are looking at areas of Southern California. Let’s grab a layer of city boundaries from SCAG.

Doing this one time, point and click is ok. Here are the steps.

  • Go to the linked SCAG URL.
  • Click the Download button.
  • Choose a file type - I’d suggest GeoJSON
  • Press Download Options - select ’Download file previously generated on…`
  • Download the file - mine was named City_Boundaries_–_SCAG_Region.geojson
  • Move it to your working directory
  • Read in the geojson file using read_sf() - (see code below)
  • Transform to WGS84 coordinate reference system (see code chunk below)
  • Make a map (see code chunk below)
#read the data
IE_cities <- read_sf(dsn = 'City_Boundaries_–_SCAG_Region.geojson') |>  
 st_transform(crs = 4326) |> 
 # Let's only include the cities in Riverside and San Bernardino
 filter(COUNTY %in% c('Riverside', 'San Bernardino')) |>
  # remove the very far away desert cities and unincorporated county areas
  filter(CITY != 'Unincorporated' & CITY != 'Needles' & CITY != 'Blythe') #|> 

head(IE_cities)
Simple feature collection with 6 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.4687 ymin: 34.00393 xmax: -115.9743 ymax: 34.64545
Geodetic CRS:  WGS 84
# A tibble: 6 × 13
  OBJECTID CITY             CITY_ID COUNTY       ACRES PERIMETER  YEAR COUNTY_ID
     <int> <chr>            <chr>   <chr>        <dbl>     <dbl> <int>     <int>
1      202 Yucaipa          87042   San Bernar… 18069.    55151.  2016        71
2      203 Grand Terrace    30658   San Bernar…  2269.    18101.  2016        71
3      204 Twentynine Palms 80994   San Bernar… 37609.    88197.  2016        71
4      205 Colton           14890   San Bernar… 10313.    70632.  2016        71
5      206 Yucca Valley     87056   San Bernar… 25468.    52659.  2016        71
6      208 Victorville      82590   San Bernar… 47356.   117505.  2016        71
# ℹ 5 more variables: ANNEX_DATE <dttm>, ANNEX_NOTE <chr>, Shapearea <dbl>,
#   Shapelen <dbl>, geometry <MULTIPOLYGON [°]>

Figure 18.1 shows a map of Cities using ggplot and geom_sf

IE_cities |> 
  ggplot() +
  geom_sf() +
  theme_void() + 
  geom_sf_text(aes(label = CITY), size = 1.5) +
  labs(x = '', y = '')

Figure 18.1: California Counties

Figure 18.2 shows a leaflet map of the cities, zoomed into the main valley.

IE_cities |> 
  #filter(CITY != 'Unincorporated') |> 
  leaflet() |> 
  addTiles() |> 
  addPolygons(color = 'black',
              weight = 1,
              fillOpacity = 0.3,
              label = ~CITY)
Figure 18.2: IE cities on a leaflet map

18.2.1 Database Joins - data science lesson #1

Dataset IE_cities has the geospatial information on city boundaries. Some of the other datasets you are using have the information you would like to compare. How can we merge or join these disparate data layers for visualization and analysis?

There are two types of joins that can help to merge disparate datasets.

The first type of join is based on unique records in both datasets. If you have a column in dataset A and it has some matching records in dataset B, there are ways to join the two datasets.

Here’s all the permutations. - inner_join() - keep only records present in both dataset A and B. - left_join() - keep all records from the dataset A and only matching variables from B. Fill missing with NA. - right_join() - keep all records from the dataset B and only matching variables from A. Fill missing with NA. - full_join() - keep all records from A and B - fill NA in any records where dataset is missing from one of the datasets. - anti_join() - keep only records from A that don’t occur in B.

The example below shows Venn diagrams of all the permutations.

Join Venn Diagrams - credit Tavareshugo github

18.3 Spatial Joins - Data science lesson #2

More interesting and probably more useful for your group projects are spatial joins.

Spatial joins use geospatial information to identify overlapping geometries in space. Thus, if one has two geospatial datasets, one can quickly identify intersections of interest.

Let’s try a simple example using the community gardens dataset and the IE_cities dataset.

First, import the community gardens dataset, which we worked on in class a few weeks ago.

gardens <- st_read(dsn = 'CommunityGardens.geojson') |> 
  st_transform(crs = 4326)
Reading layer `CommunityGardens' from data source 
  `C:\Dev\EA078_Fall2023\CommunityGardens.geojson' using driver `GeoJSON'
Simple feature collection with 186 features and 12 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -119.2673 ymin: 32.78017 xmax: -115.5097 ymax: 34.52903
Geodetic CRS:  WGS 84
head(gardens)
Simple feature collection with 6 features and 12 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -119.1049 ymin: 32.78017 xmax: -115.5097 ymax: 34.22143
Geodetic CRS:  WGS 84
    County       City             Type
1 Imperial Calipatria    School Garden
2 Imperial   Imperial Community Garden
3 Imperial  El Centro    School Garden
4 Imperial    Brawley    School Garden
5 Imperial  El Centro    School Garden
6  Ventura  Camarillo   Urban Farm/CSA
                                       Name
1        Best STEP Forward Community Garden
2       Pacific SouthWest Community Gardens
3 ICOE Little Roadrunner's Preschool Garden
4             Sacred Heart Preschool Garden
5           ICOE Kids R Us Preschool Garden
6                        The Abundant Table
                                   Address      LAT      LONG
1 210 N Railroad, Calipatria, CA, CA 92233 33.12725 -115.5097
2     2385 Myrtle Road, Imperial, CA 92251 32.82423 -115.5733
3       200 N 12th St, El Centro, CA 92243 32.79472 -115.5660
4                        Brawley, CA 92227 32.97450 -115.5347
5 1528 S Waterman Ave, El Centro, CA 92243 32.78018 -115.5749
6 1012 W Ventura Blvd, Camarillo, CA 93010 34.22144 -119.1049
                                                                                Source
1                                                     http://www.beststepforward.life/
2 https://pacificsouthwest.wixsite.com/pswcdc/single-post/2017/06/07/community-gardens
3              https://www.icoe.org/news/calfresh-imperial-county-and-ecep-garden-kick
4                                        https://www.iccopa.com/community-gardens.html
5                                        https://www.iccopa.com/community-gardens.html
6                                         https://www.venturacountyfarmday.com/mcgrath
  F9 F10 F11 ORIG_FID                   geometry
1                   1 POINT (-115.5097 33.12724)
2                   2 POINT (-115.5733 32.82422)
3                   3  POINT (-115.566 32.79471)
4                   4  POINT (-115.5347 32.9745)
5                   5 POINT (-115.5748 32.78017)
6                   6 POINT (-119.1049 34.22143)

Now this dataset is pretty easy to manipulate because it already has a City field indicated. But maybe you don’t know all the IE cities or counties? Geospatially, we can use sf::st_filter() to select any garden locations that intersect the city boundaries in the IE.

#This is necessary to prevent an error with non-planar geometry that happens in sf
sf_use_s2(FALSE)
Spherical geometry (s2) switched off
gardens |> 
  st_filter(IE_cities) |> 
  leaflet() |> 
  addTiles() |> 
  addAwesomeMarkers(label = ~Name)
although coordinates are longitude/latitude, st_intersects assumes that they
are planar
although coordinates are longitude/latitude, st_intersects assumes that they
are planar

18.4 Example 2 - Average biodiversity by city

The first example was easy. The second is going to be two steps - a spatial join, then a summary of statistical values.

First grab the biodiveristy dataset.

biodiversity <- st_read(dsn = 'Biodiversity.geojson') |> 
  st_transform(crs = 4326) |> 
  st_zm()
Reading layer `Biodiversity' from data source 
  `C:\Dev\EA078_Fall2023\Biodiversity.geojson' using driver `GeoJSON'
Simple feature collection with 2015 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -119.4784 ymin: 32.61859 xmax: -114.1312 ymax: 35.80909
Geodetic CRS:  WGS 84

Now we’re going to filter it just using the same cities, and make a map of biodiversity within the cities. Figure 18.3 shows that map.

palBD <- colorFactor(domain = biodiversity$RANK, palette = 'Spectral')

biodiversity |> 
  st_filter(IE_cities, .predicate = st_intersects) |> 
  leaflet() |> 
  addProviderTiles(provider = providers$CartoDB.Positron) |> 
  addPolygons(color = ~palBD(RANK),
              weight = 1) |> 
  addLegend(title = 'Biodiversity',
            values = ~RANK,
            pal = palBD) |> 
  addPolygons(data = IE_cities,
              color = 'black',
              weight = 1,
              fillOpacity = 0)
although coordinates are longitude/latitude, st_intersects assumes that they
are planar